Markup is Markup
نویسنده
چکیده
This position paper makes several simple points that locate the treatment of Web pages within the spectrum of approaches to the texts that are normally used by the linguistically sophisticated language processing community. The first point is that what ostensibly makes the Web a special kind of text, html markup and links, is just a variation, albeit a very populist one, on text types that much of the community has been working with for a long time. Two other points deal with specific techniques that the author has used in his own work with such texts which seem to be particularly effective at providing a graded syntactic and semantic analysis of unconstrained texts with markup. 1. The Impact of Markup The most important point to make is that the truely significant difference is not between Web pages and other types of text but between texts that include markup, the Web among them, and texts that do not. All of our prior experience with markup of any sort carries over to the Web; we are not starting afresh. Taking the chance of restating the obvious, let me begin here by reviewing what constitutes markup, and then discussing some of the benefits and complications of doing linguistic analyses for content using texts with markup. 1.1 Different sorts of markup Generally speaking, markup is the addition of information to a text that identifies its parts and their relationships to each other while not contributing to the text’s wording or conventional content. The original notion is probably ’printer’s marks’---character sequences that book publishers introduce into the electronic form of a manuscript to indicate to the typesetting devices when a region is to appear in bold, when the fonts shifts, alternations in leading, and so on. The idiosyncrasy and specificity of purpose of this sort of markup, however, make it not particularly well suited to nip programs, something that is all too familiar to the people who have had to interpret and strip them from electronic dictionaries. From the early days of professional computing we have marked up our documents with commands directed at formatting programs ranging from troff to Tex. This document itself is being written using invisible (wysiwyg) markup maintained by Microsoft’s Word program. This markup can be made explicit, and thereby acces-sible by a The work reported in this paper was done in 1993-95 while the author was self-employed. The opinons expressed here are his own and not necessarily those of Gensym Corp. program, by having the document written out using RTF, "rich text format" (as opposed to writing it out using one of Word’s own custom formats or in ’plain ascii’ or passing it through a postscript compiler for printing). The author has worked with texts (documentation) in ascii that incorporated standard markup derived by rule from the RTF version of the originals by a program written by Bran Boguraev. While RTF and its equivalents in the PC world are at least as awkward as printer’s marks, they still are a means of having a document with markup as the starting point of one’s analysis rather than just a stream of tokens and whitespace. Standard markup in the form of sgml (html is just an instantiation of sgml) is far and away the simplest sort of markup to process (beause of the uniformity of its anglebracket notation) while simultaneously being the most expressive (because it is an open-ended standard). contrast with the other forms of markup, the thrust of sgml as a standard is to encourage the delimitation and naming of text regions (markup) according to semantic criteria rather than just as a means of constraining their presentation in some viewer. In the NLP community, we have seen simple instantiations of semantic markup since the first LDC corpus distributions, and more recently with the advent of efforts to construct semantically tagged tree banks and with the results formats of MUC-6 (which were marked to delimit company and person names, dates, and so on). It is also beginning to appear on the Web. The SEC Edgar site with its data on corporate financial filings (www.sec. gov/edgarhp.htm), for example, looks odd to the uninitiated, since it comes up as plain text interspersed with notations in angle brackets. These notations are of course sgml tags, and they provide enough of a structural overlay to the text to permit it to be processed for content by very simple programs. Unfortunately, even though sgml has the longer history and greater flexibility, html is what everyone has 104 From: AAAI Technical Report SS-97-02. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved. and is what virtually everyone uses. This can lead to the regrettable practice of encoding semantic information in patterns of html tags, a problem that will be touched on below. There is no conflict in principle between having markup for both content and form. With the right software (which doesn’t appear to exist commercially, though it would be nice to be shown wrong), the DTD of an sgml instantiation could be annotated with corresponding html tags. This would allow simple Web pages to be produced automatically from the sgml documents. (These pages would have no links or just those constructable by rule.) With adaptations to the browsing code we could then arrange for us or our programs to access the original, semantically tagged document files just as easily as we today can access the html files. 1.2 Focusing analysis Here and in the rest of this paper I am taking the goal of the enterprise to be information extraction, which means the nip analysis program will be answering questions such as what is this page about, or if that can’t be determined, are there identifiable phrasal fragments that would give a better guess at the topic than an isolated word search; if the topic is one that we have a model for, what is this page saying. Vastly many other goals for applying NLP to the Web are conceivable and practical, but I will restrict the discussion here to just IE.I Part of what makes the Web fascinating and important is that it is a populist medium where virtually anyone can be a content publisher, not just those with specialized technical skills or authors going through established, editorially-controlled channels (regular newspapers or magazines). This means that we have content from grade school kids and the SEC all on the same network. (I would venture that even the SEC would not have a substantial Web presence if it wasn’t such an easy to do.) Given that range, any IE application for the Web will have to deal with texts that aren’t pre-selected by topic and will of necessity end up giving graded, partial results in most cases. 2 This is enormously difficult to do in a text I am also personally interested in generating pages dynamically. For instance in an application currently underway I am generating pages for the transcripts and requirements audits of undergraduates, which includes some pages that are generated on the fly in response to queries such as what courses are still available for satisfying a particular requirement given what courses from that group the student has already taken. In the future I expect to take up the problem of dynamically generating summaries. This happens to be precisely the application model for the original Tipster program: use powerful information retrieval to winnow the full set of texts into small sets by topic and then hand those texts off to semanticallyspecialized, sublanguage specific language comprehension system (Information Extraction systems) that put the references and relationships that they have identified that has no markup, but becomes more plausible as markup is added, and even more so to the extent that the markup indicates semantic rather than presentational types. The primary thing that this markup is adding is a capability to focus or narrow the analysis program’s attention region by region through the text. A simple but quite serious case where focus is important is knowing when the text region being analyzed corresponds to a title. The motivation for this is that one of the more successful things on can do in the shallow semantic analysis of open texts is to identify which text sequences correspond to names. Given the name of a company, a person, a place, a movie, rock group, etc. that name can then be made available to a search engine, the page it is part of can be correlated with other pages making the same reference or constellation of references, both specifically or by type, or, given good techniques for following subsequent references, the sections of the page can be grouped to provide excerpts or a derived page produced with the names pulled out as links or the original page republished with the names marked up. However, the standard technique for identifying new names depends on the capitalization of the words in the name as the means of delimiting it (see McDonald, 1993), and the capitalization convention in titles is completely different. If we know we are processing a title (e.g. we are within the title or h 1 through h6 tags) we can turn off the capitalization-driven name finder and avoid substantial and confusing false positives. Consider, for example, the title "Sears to Sell MailOrder Line" (Wall Street Journal, 1/14/97). Assume that the analyzer has a reasonable lexicon for mergers & acquisitions and so will recognize the word "sell", leaving the segment "Mail-Order Line". The body of the article and our common-sense knowledge of the phrase ’mail order’ reveal that this segment is a descriptive noun phrase rather than the name of a company ("... would sell its Fremans’ mail order unit to ..."), but how is a name-finder application-which cannot, realistically, have a vocabulary large enough to include phrases for every type of product and service--know that that capitalized sequence isn’t the equivalent of "North American Van Lines", which is a company. When analyzing news articles (a non-trivial part of the Web), one can expect to see every person or company that is mentioned in a title repeated with their full names in the body of the article. Therefore if the markup will allow a title to be reliably distinguished from a body (and thereby ignored), the set of candidate names made available to other processes will be markedly more reliable. Another example of markup providing focus is decoding pronouns within list structures. In the documentation I worked with (which originated as RTF and was then rendered into presentation-oriented sgml), a standard pattern was to introduce an operation within a paragraph into database format for use by other reasoning or tabulating applications.
منابع مشابه
مدل سازی شوک های مارک آپ با استفاده از مدل DSGE (مورد ایران)
This paper investigates the effects of markup shocks of domestic and export goods prices on macroeconomic variables by using a Dynamic Stochastic General Equilibrium (DSGE) model for Iran, in order to examine the effect of the growth of market power and monopoly in domestic and exporting markets from a macroeconomic viewpoint. To this end, the optimal pricing process of domestic, importing and ...
متن کاملImpact of Structural Components of Market on the Markup Level Based on Radial Basis Neural Network and Fuzzy Logic
This paper aims to evaluate the impact of several indices of market structure including entry to barrier, economies of scale and concentration degree on 140 active industries using the digit. Accordingly, we apply three methods including cost disadvantages ratio ( ), Herfindahl–Hirschman concentration index ( ) and Comanor and Willson criterion in order to assess the economies of scale and usin...
متن کاملتخمین همزمان مارک ـ آپ و بازدهی نسبت به مقیاس در صنایع کارخانهای ایران
The current study is an attempt to estimate markup and return to scale of 19 two-digit ISIC manufacturing industries of Iran, simultaneously, in accordance to Solow Residual and Structural approach, during the period 1995-2007. Based on Solow Residual approach, the neoclassical assumption of constant return to scale is approved within 95% of manufacturing industries; however in 84% of industrie...
متن کاملThe Theory of Concentration Oligopoly
This paper originates the theory of buyer concentration for a main raw material input for a single processing industry. The Oligopsony concentration is obtained and subsequently decomposed into several factors, affecting indirectly the industry’s profitability. It is found that the leading firms’ efficiencies hypothesis is reaffirmed due to variations associated with the marginal productivity d...
متن کاملImplications of the Imperfect Deposit Market Structure for Micro and Macro Discretionary Prudential Policies
The aim of this study is to theoretically investigate the role of the bank deposit market structure in how effective micro and macro prudential policies in determining the regulatory capital of banks in combination with monetary policy. To achieve this, a partial equilibrium analytical framework has been developed that includes rational economic entities and the possibility of contagion risk in...
متن کاملFramework for a music markup language
Objects and processes of music that would be marked with a markup language need to be demarcated before a markup language can be designed. This paper investigates issues to be considered for the design of an XML-based general music markup language. Most present efforts focus on CWN (Common Western Notation), yet that system addresses only a fraction of the domain of music. It is argued that a g...
متن کامل